09. RNN Training

07 RNN Training V4

Training vs. Testing

During training, we have a true caption which is fixed, but during testing the caption is being actively generated (starting with <start>), and at each step you are getting the most likely next word and using that as input to the next LSTM cell.

Caption Generation, Test Data

After the CNN sees a new, test image, the decoder should first produce the <start> token, then an output distribution at each time step that indicates the most likely next word in the sentence. We can sample the output distribution (namely, extract the one word in the distribution with the highest probability of being the next word) to get the next word in the caption and keep this process going until we get to another special token: <end>, which indicates that we are done generating a sentence.